13-cs01-analysis

Professor Shannon Ellis

2/23/23

CS01: Right-To-Carry (Analysis)

Q&A

Q: I’m curious about how different groups will approach this dataset differently, and hope that we can have some kind of showcase after this case study is completed!
A: I am too - sometimes students add another (related) question. Some find a new related dataset. Some go really deep on visualization. I like the idea of a showcase! Was not in my plan, but I think it should be!

Q: I’m confused about…how to choose the best EDA.
A: The best EDA is the EDA that best helps you understand the datasets being used for analysis. So, while there’s not “one plot to rule them all,” a great EDA uses visualizations and data summaries that let the analyst really understand the data. Two EDAs can make different decisions and both be “best”.

Course Announcements

Due Dates:

  • Lecture Participation survey “due” after class
  • Lab06 due tomorrow (2/24; 11:59 PM)
  • HW03 due Mon (2/27; 11:59 PM)

Notes:

  • Final Project Groups survey (link also on canvas; “due” Friday)
  • Accept the GH repo for cs01; talk with group mates
  • Midterm grades posted (Canvas) and feedback available (GitHub Issue)
    • Answer key on website
    • Regrades open until Sunday at 5PM (GitHub issue or Campuswire post)
  • CS01 Data Wrangling: Campuswire Post

Agenda

  • panel data & analysis
  • modelling the LOTT and DONOGHUE data
  • Multicollinearity
  • VIF

Questions

  1. What is the relationship between right to carry laws and violence rates in the US?
  2. What is the effect of multicollinearity on coefficient estimates from linear regression models when analyzing right to carry laws and violence rates?

Packages & Data

# i've asked ITS to install those on datahub for you
# if that's not yet complete, you can install using install.packages()
library(tidyverse)
library(broom)
library(plm) 
library(car) 
library(rsample) 
library(GGally) 
library(ggcorrplot) 

This will only work if you finished the wrangling…

load("data/wrangled/wrangled_data_rtc.rda")

Panel Data

Panel Data

  • repeated measures for multiple panel members or individuals over time.
    • multiple units (violent crime and other variables for each state)
    • multiple time points (multiple years)

Lingo:

  • \(N\) individual panel members
  • \(T\) time points
  • Balanced Panels: At each time point ( \(T\) ), there are data points for each individual( \(N\) ). ( \(n = N∗T\) )
  • Unbalanced Panels: May be data points missing for some individuals ( \(N\) ) at some time points ( \(T\) ) ( \(n\) observations \(\lt N∗T\))

Our Panel Data

In our case we have:

  • \(N\) = 44 states (in the data wrangling process we removed those who had adopted an RTC law before 1980)
  • \(T\) = 31 years (1980 - 2010)

Panel is balanced: \(n=44∗31\), thus \(n = 1364\)

Panel Linear Regression

\[Y_{it}=β_{0}+β_{1}X_{1it}+...+β_{K}X_{Kit}+e_{it}\]

  • \(i\) is the individual dimension (in our case individual states)
  • \(t\) is the time dimension.

Notes:

  • Some explanatory variables \(X_{it}\) will vary across individuals and time
  • others will be fixed across the time of the study (or don’t change over time)
  • others still will be fixed across individuals but vary across time periods

❓ Which are examples of variables in our analysis that likely fall into each of these categories?

Fixed Effects Panel Regression Analysis

  • assumes that there are unknown or unobserved unique aspects about the individuals or heterogeneity among individuals (states) \(a_i\) that are not explained by the independent variables but influence the outcome variable of interest (crime).
  • they do not vary with time or in other words are fixed over time but may be correlated with independent variables \(X_{it}\)
  • The intercept can be different for each individual \(\beta_{0i}\) (each state)
  • Other coefficients are assumed to be the same across all the individuals.

In our data…some of the unobserved qualities about the different states may be correlated with some of our independent variables (i.e.level of economic opportunity might be an unobserved feature about the states that influences violent crime rate (outcome) and would be possibly correlated with poverty rate and unemployment (predictors))

Fixed Effects Panel Regression Analysis (Model)

These individual \(a_i\) effects can be correlated with the independent variables \(X\):

\[Y_{it}=\beta_{0}+\beta_{1}X_{1it}+...\beta_{K}X_{Kit}+ a_i +e_{it}\] …or alternatively the individual effects can be absorbed into an individual-specific intercept term \(\beta_{0i}\):

\[Y_{it}=\beta_{0i}+\beta_{1}X_{1it}+...\beta_{k}X_{kit} +e_{it}\]

Panel Linear Model (plm)

  • carry out panel linear models
  • New object: pdata.frame (panel data frame); can indicate:
    • which variable should be used to identify the individuals in our panel (STATE)
    • what variable should be used to identify the time periods in our panel (YEAR)
  • Will be specified in index parameter: index = c("Individual_Variable_NAME", "Time_Period_Variable_NAME")

pdata.frame

d_panel_DONOHUE <- pdata.frame(DONOHUE_DF, index = c("STATE", "YEAR"))
class(d_panel_DONOHUE)
[1] "pdata.frame" "data.frame" 
head(d_panel_DONOHUE, n = 3)
            YEAR  STATE Black_Male_15_to_19_years Black_Male_20_to_39_years
Alaska-1980 1980 Alaska                 0.1670456                 0.9933775
Alaska-1981 1981 Alaska                 0.1732299                 1.0028219
Alaska-1982 1982 Alaska                 0.1737069                 1.0204445
            Other_Male_15_to_19_years Other_Male_20_to_39_years
Alaska-1980                  1.129782                  2.963329
Alaska-1981                  1.124441                  2.974775
Alaska-1982                  1.069821                  3.015071
            White_Male_15_to_19_years White_Male_20_to_39_years
Alaska-1980                  3.627805                  18.28852
Alaska-1981                  3.558261                  18.12821
Alaska-1982                  3.391844                  18.10666
            Unemployment_rate Poverty_rate Viol_crime_count Population
Alaska-1980               9.6          9.6             1919     404680
Alaska-1981               9.4          9.0             2537     418519
Alaska-1982               9.9         10.6             2732     449608
            police_per_100k_lag RTC_LAW_YEAR RTC_LAW TIME_0 TIME_INF
Alaska-1980            194.7218         1995   FALSE   1980     2010
Alaska-1981            200.2299         1995   FALSE   1980     2010
Alaska-1982            191.0553         1995   FALSE   1980     2010
            Viol_crime_rate_1k Viol_crime_rate_1k_log Population_log
Alaska-1980           4.742018               1.556463       12.91085
Alaska-1981           6.061851               1.802015       12.94448
Alaska-1982           6.076404               1.804413       13.01613

Model Considerations: effect

There are three main options for the effect argument:

  1. individual - model for the effect of individual identity
  2. time - model for the effect of time
  3. twoways - meaning modeling for the effect of both individual identity and time
  • speculate that there is an effect of individual STATE identity and time on violent crime rate (effect = "twoways")
    • aka expect some states to have high rates of crime, and others to have low rates of crime
    • expect crime rates to change over time

Model Considerations: model

  1. pooling - standard pooled ordinary least squares regression model
  2. within - fixed effects model (variation between individuals is ignored, model compares individuals to themselves at different periods of time)
  3. between - fixed effects model (variation within individuals from one time point to another is ignored, model compares different individuals at each point of time)
  4. random - random effects (each state has a different intercept but force it to follow a normal distribution - requires more assumptions)

interested in how violence in each state varied over time: within STATE variation (model = within)

Modelling

Model: plm (DONOHUE)

DONOHUE_OUTPUT <- plm(Viol_crime_rate_1k_log ~
                      RTC_LAW +
                      White_Male_15_to_19_years +
                      White_Male_20_to_39_years +
                      Black_Male_15_to_19_years +
                      Black_Male_20_to_39_years +
                      Other_Male_15_to_19_years +
                      Other_Male_20_to_39_years +
                      Unemployment_rate +
                      Poverty_rate +
                      Population_log +
                      police_per_100k_lag,
                      effect = "twoways",
                      model = "within",
                      data = d_panel_DONOHUE)

❓ What is our outcome variable here? What is our primary predictor of interest? Why are all the other variables included?

Model Output: DONOHUE

DONOHUE_OUTPUT_TIDY <- tidy(DONOHUE_OUTPUT, conf.int = 0.95)
DONOHUE_OUTPUT_TIDY
# A tibble: 11 × 7
   term                      estimate std.e…¹ stati…²  p.value conf.low conf.h…³
   <chr>                        <dbl>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>
 1 RTC_LAWTRUE                2.40e-2 1.70e-2   1.42  1.56e- 1 -9.19e-3  5.73e-2
 2 White_Male_15_to_19_years  1.04e-2 2.82e-2   0.367 7.13e- 1 -4.49e-2  6.56e-2
 3 White_Male_20_to_39_years  2.93e-2 1.00e-2   2.93  3.50e- 3  9.68e-3  4.90e-2
 4 Black_Male_15_to_19_years -5.97e-2 5.78e-2  -1.03  3.02e- 1 -1.73e-1  5.36e-2
 5 Black_Male_20_to_39_years  1.23e-1 1.95e-2   6.34  3.17e-10  8.53e-2  1.62e-1
 6 Other_Male_15_to_19_years  6.74e-1 1.14e-1   5.92  4.15e- 9  4.51e-1  8.97e-1
 7 Other_Male_20_to_39_years -3.04e-1 3.83e-2  -7.95  4.21e-15 -3.79e-1 -2.29e-1
 8 Unemployment_rate         -1.71e-2 4.98e-3  -3.42  6.36e- 4 -2.68e-2 -7.29e-3
 9 Poverty_rate              -7.60e-3 2.99e-3  -2.54  1.12e- 2 -1.35e-2 -1.74e-3
10 Population_log            -2.11e-1 6.17e-2  -3.42  6.55e- 4 -3.32e-1 -8.99e-2
11 police_per_100k_lag        5.59e-4 1.40e-4   4.00  6.72e- 5  2.85e-4  8.33e-4
# … with abbreviated variable names ¹​std.error, ²​statistic, ³​conf.high
DONOHUE_OUTPUT_TIDY$Analysis <- "Donohue"

Formula: as.formula() (LOTT)

LOTT_variables <- LOTT_DF |>
  select(RTC_LAW,
         contains(c("White", "Black", "Other")),
         Unemployment_rate,
         Poverty_rate,
         Population_log,
         police_per_100k_lag) |>
  colnames()

LOTT_fmla <- as.formula(paste("Viol_crime_rate_1k_log ~", paste(LOTT_variables, collapse = " + ")))
LOTT_fmla
Viol_crime_rate_1k_log ~ RTC_LAW + White_Female_10_to_19_years + 
    White_Female_20_to_29_years + White_Female_30_to_39_years + 
    White_Female_40_to_49_years + White_Female_50_to_64_years + 
    White_Female_65_years_and_over + White_Male_10_to_19_years + 
    White_Male_20_to_29_years + White_Male_30_to_39_years + White_Male_40_to_49_years + 
    White_Male_50_to_64_years + White_Male_65_years_and_over + 
    Black_Female_10_to_19_years + Black_Female_20_to_29_years + 
    Black_Female_30_to_39_years + Black_Female_40_to_49_years + 
    Black_Female_50_to_64_years + Black_Female_65_years_and_over + 
    Black_Male_10_to_19_years + Black_Male_20_to_29_years + Black_Male_30_to_39_years + 
    Black_Male_40_to_49_years + Black_Male_50_to_64_years + Black_Male_65_years_and_over + 
    Other_Female_10_to_19_years + Other_Female_20_to_29_years + 
    Other_Female_30_to_39_years + Other_Female_40_to_49_years + 
    Other_Female_50_to_64_years + Other_Female_65_years_and_over + 
    Other_Male_10_to_19_years + Other_Male_20_to_29_years + Other_Male_30_to_39_years + 
    Other_Male_40_to_49_years + Other_Male_50_to_64_years + Other_Male_65_years_and_over + 
    Unemployment_rate + Poverty_rate + Population_log + police_per_100k_lag

Model: plm (LOTT)

d_panel_LOTT <- pdata.frame(LOTT_DF, index = c("STATE", "YEAR"))

LOTT_OUTPUT <- plm(LOTT_fmla,
                   model = "within",
                   effect = "twoways",
                   data = d_panel_LOTT)

Model Output: LOTT

LOTT_OUTPUT_TIDY <- tidy(LOTT_OUTPUT, conf.int = 0.95)
LOTT_OUTPUT_TIDY
# A tibble: 41 × 7
   term                        estimate std.e…¹ stati…²  p.value conf.…³ conf.…⁴
   <chr>                          <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>
 1 RTC_LAWTRUE                 -0.0518   0.0162  -3.20  1.39e- 3 -0.0835 -0.0201
 2 White_Female_10_to_19_years  0.636    0.149    4.26  2.24e- 5  0.343   0.929 
 3 White_Female_20_to_29_years  0.00698  0.0670   0.104 9.17e- 1 -0.124   0.138 
 4 White_Female_30_to_39_years  0.261    0.0813   3.21  1.38e- 3  0.101   0.420 
 5 White_Female_40_to_49_years  0.0168   0.0814   0.206 8.37e- 1 -0.143   0.176 
 6 White_Female_50_to_64_years -0.459    0.0625  -7.35  3.60e-13 -0.582  -0.337 
 7 White_Female_65_years_and_…  0.156    0.0469   3.33  9.04e- 4  0.0641  0.248 
 8 White_Male_10_to_19_years   -0.583    0.143   -4.07  4.92e- 5 -0.863  -0.302 
 9 White_Male_20_to_29_years    0.0639   0.0623   1.03  3.05e- 1 -0.0582  0.186 
10 White_Male_30_to_39_years   -0.200    0.0859  -2.33  2.01e- 2 -0.368  -0.0315
# … with 31 more rows, and abbreviated variable names ¹​std.error, ²​statistic,
#   ³​conf.low, ⁴​conf.high
LOTT_OUTPUT_TIDY$Analysis <- "Lott"

RTC coefficient comparison

comparing_analyses <- DONOHUE_OUTPUT_TIDY |>
  bind_rows(LOTT_OUTPUT_TIDY) |>
  filter(term == "RTC_LAWTRUE")

comparing_analyses
# A tibble: 2 × 8
  term        estimate std.error statistic p.value conf.low conf.high Analysis
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl> <chr>   
1 RTC_LAWTRUE   0.0240    0.0170      1.42 0.156   -0.00919    0.0573 Donohue 
2 RTC_LAWTRUE  -0.0518    0.0162     -3.20 0.00139 -0.0835    -0.0201 Lott    

🧠 With each analysis, what would your conclusion be to the question “What is the relationship between right to carry laws and violence rates in the US”?

RTC

ggplot(comparing_analyses) +
  geom_point(aes(x = Analysis, y = estimate)) +
  geom_errorbar(aes(x = Analysis, ymin = conf.low, ymax = conf.high), width = 0.25) +
  geom_hline(yintercept = 0, color = "red") +
  scale_y_continuous(
    breaks = seq(-0.2, 0.2, by = 0.05),
    labels = seq(-0.2, 0.2, by = 0.05),
    limits = c(-0.2, 0.2)
  ) +
  geom_segment(aes(x = 1, y = 0.125, xend = 1, yend = 0.175),
    arrow = arrow(angle = 45, ends = "last", type = "open"),
    size = 2, color = "green", lineend = "butt", linejoin = "mitre"
  ) +
  geom_segment(aes(x = 2, y = -0.125, xend = 2, yend = -0.175),
    arrow = arrow(angle = 45, ends = "last", type = "open"),
    size = 2, color = "red", lineend = "butt", linejoin = "mitre"
  ) +
  theme_minimal() +
  theme(
    axis.title.x = element_blank(),
    axis.text = element_text(size = 8, color = "black")
  ) +
  labs(
    title = "Effect estimate on ln(violent crimes per 100,000 people)",
    y = "  Effect estimate (95% CI)"
  )

Similar Data - different results

  • The only difference between the two data frames rests in how the demographic variables were parameterized.
    • Donohue: Males only; ages 15 to 39
    • Lott: Males/Females; ages 10-65+

…so how did this occur?

Multicollinearity

Multicollinearity

  • multicollinearity occurs when independent variables (predictors) are highly related to one another
  • When seemingly independent variables are highly related to one another, the relationships estimated in an analysis may be distorted.
  • since linear regression aims to determine how a one unit change in a regressor influences a one unit change in the dependent variable, if the regressors are collinear…effect of each regressor cannot be accurately estimated

Diagnosing Multicollinearity: Pairplots

DONOHUE_DF |>
  select(RTC_LAW,
         Viol_crime_rate_1k_log,
         Unemployment_rate,
         Poverty_rate,
         Population_log) |>
  ggpairs(columns = c(2:5),
          lower = list(continuous = wrap("smooth_loess",
                                         color = "red",
                                         alpha = 0.5,
                                         size = 0.1)))

…not much correlation for non-demographic variables - unemployment and poverty rate show some (as expeted)

Diagnosing Multicollinearity: Heatmaps

cor_DONOHUE_dem <- cor(DONOHUE_DF |> select(contains("_years")))

ggcorrplot(cor_DONOHUE_dem,
  tl.cex = 6,
  hc.order = TRUE,
  colors = c(
    "red",
    "white",
    "red"
  ),
  outline.color = "transparent",
  title = "Correlation Matrix: Donohue",
  legend.title = expression(rho)
)

  • strong correlation within race
  • race shows much stronger correlation than age
  • suggests collinearity

Heatmap: Lott

cor_LOTT_dem <- cor(LOTT_DF |> select(contains("_years")))

corr_mat_LOTT <- ggcorrplot(cor_LOTT_dem,
  tl.cex = 6,
  hc.order = TRUE,
  colors = c(
    "red",
    "white",
    "red"
  ),
  outline.color = "transparent",
  title = "Correlation Matrix: Lott",
  legend.title = expression(rho)
)

corr_mat_LOTT

Suggested multicollinearity

…so how do you find out for sure?

  • look at the stability of the coefficient estimates under perturbations of the data
  • we’ll focus on RTC_LAW
  • use resampling (here: remove one observation, see if estimates change == LOOCV - leave one out cross-validation): rsample::loo_cv

LOOCV: splits (Donohue)

## split data
set.seed(124)
DONOHUE_splits <- d_panel_DONOHUE |> loo_cv()
DONOHUE_splits
# Leave-one-out cross-validation 
# A tibble: 1,364 × 2
   splits           id        
   <list>           <chr>     
 1 <split [1363/1]> Resample1 
 2 <split [1363/1]> Resample2 
 3 <split [1363/1]> Resample3 
 4 <split [1363/1]> Resample4 
 5 <split [1363/1]> Resample5 
 6 <split [1363/1]> Resample6 
 7 <split [1363/1]> Resample7 
 8 <split [1363/1]> Resample8 
 9 <split [1363/1]> Resample9 
10 <split [1363/1]> Resample10
# … with 1,354 more rows

LOOCV: get data (Donohue)

DONOHUE_subsets <- map(pull(DONOHUE_splits, splits), training)
glimpse(DONOHUE_subsets[[1]])
Rows: 1,363
Columns: 20
$ YEAR                      <fct> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 19…
$ STATE                     <fct> Alaska, Alaska, Alaska, Alaska, Alaska, Alas…
$ Black_Male_15_to_19_years <pseries> 0.1670456, 0.1732299, 0.1737069, 0.17095…
$ Black_Male_20_to_39_years <pseries> 0.9933775, 1.0028219, 1.0204445, 1.03127…
$ Other_Male_15_to_19_years <pseries> 1.1297816, 1.1244412, 1.0698208, 0.98828…
$ Other_Male_20_to_39_years <pseries> 2.963329, 2.974775, 3.015071, 3.008048, …
$ White_Male_15_to_19_years <pseries> 3.627805, 3.558261, 3.391844, 3.222002, …
$ White_Male_20_to_39_years <pseries> 18.28852, 18.12821, 18.10666, 17.90600, …
$ Unemployment_rate         <pseries> 9.6, 9.4, 9.9, 9.9, 9.8, 9.7, 10.9, 10.3…
$ Poverty_rate              <pseries> 9.6, 9.0, 10.6, 12.6, 9.6, 8.7, 11.4, 12…
$ Viol_crime_count          <pseries> 1919, 2537, 2732, 2940, 3108, 3031, 3046…
$ Population                <pseries> 404680, 418519, 449608, 488423, 513697, …
$ police_per_100k_lag       <pseries> 194.7218, 200.2299, 191.0553, 364.2335, …
$ RTC_LAW_YEAR              <pseries> 1995, 1995, 1995, 1995, 1995, 1995, 1995…
$ RTC_LAW                   <pseries> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ TIME_0                    <pseries> 1980, 1980, 1980, 1980, 1980, 1980, 1980…
$ TIME_INF                  <pseries> 2010, 2010, 2010, 2010, 2010, 2010, 2010…
$ Viol_crime_rate_1k        <pseries> 4.742018, 6.061851, 6.076404, 6.019373, …
$ Viol_crime_rate_1k_log    <pseries> 1.556463, 1.802015, 1.804413, 1.794983, …
$ Population_log            <pseries> 12.91085, 12.94448, 13.01613, 13.09894, …

Function: bootstrapping (Donohue)

fit_nls_on_bootstrap_DONOHUE <- function(subset) {
  plm(Viol_crime_rate_1k_log ~ RTC_LAW +
        White_Male_15_to_19_years +
        White_Male_20_to_39_years +
        Black_Male_15_to_19_years +
        Black_Male_20_to_39_years +
        Other_Male_15_to_19_years +
        Other_Male_20_to_39_years +
        Unemployment_rate +
        Poverty_rate +
        Population_log +
        police_per_100k_lag,
      data = data.frame(subset),
      index = c("STATE", "YEAR"),
      model = "within",
      effect = "twoways")
}

Use the Function (Donohue)

subsets_models_DONOHUE <- map(DONOHUE_subsets, fit_nls_on_bootstrap_DONOHUE)

subsets_models_DONOHUE <- subsets_models_DONOHUE |>
  map(tidy)

subsets_models_DONOHUE[[1]]
# A tibble: 11 × 5
   term                       estimate std.error statistic  p.value
   <chr>                         <dbl>     <dbl>     <dbl>    <dbl>
 1 RTC_LAWTRUE                0.0237    0.0170       1.40  1.62e- 1
 2 White_Male_15_to_19_years  0.0108    0.0282       0.382 7.02e- 1
 3 White_Male_20_to_39_years  0.0294    0.0100       2.93  3.42e- 3
 4 Black_Male_15_to_19_years -0.0596    0.0578      -1.03  3.02e- 1
 5 Black_Male_20_to_39_years  0.123     0.0195       6.34  3.12e-10
 6 Other_Male_15_to_19_years  0.676     0.114        5.94  3.68e- 9
 7 Other_Male_20_to_39_years -0.304     0.0383      -7.94  4.43e-15
 8 Unemployment_rate         -0.0171    0.00498     -3.43  6.14e- 4
 9 Poverty_rate              -0.00759   0.00299     -2.53  1.14e- 2
10 Population_log            -0.211     0.0617      -3.42  6.43e- 4
11 police_per_100k_lag        0.000556  0.000140     3.98  7.32e- 5

Note: This code takes a while to run. Suggestion to cache this code chunk!

Save bootstrap output (Donohue)

save(subsets_models_DONOHUE,
  file = "data/wrangled/DONOHUE_simulations.rda")

LOOCV: splits (Lott)

set.seed(124)
LOTT_splits <- d_panel_LOTT |> loo_cv()
LOTT_subsets <- map(pull(LOTT_splits, splits), training)

Function: bootstrapping (Lott)

fit_nls_on_bootstrap_LOTT <- function(split) {
  plm(LOTT_fmla,
      data = data.frame(split),
      index = c("STATE", "YEAR"),
      model = "within",
      effect = "twoways"
  )
}

Use the Function (Lott)

subsets_models_LOTT <- map(LOTT_subsets, fit_nls_on_bootstrap_LOTT)
subsets_models_LOTT <- subsets_models_LOTT |>
  map(tidy)

Note: This code takes a while to run. Suggestion to cache this code chunk!

Save bootstrap output (Lott)

save(subsets_models_LOTT,
  file = "data/wrangled/LOTT_simulations.rda")

Visualize the results

names(subsets_models_DONOHUE) <- paste0("DONOHUE_", seq_len(length(subsets_models_DONOHUE)))

names(subsets_models_LOTT) <-
  paste0("LOTT_", 1:length(subsets_models_LOTT))

Combine simulation data

simulations_DONOHUE <- subsets_models_DONOHUE |>
  bind_rows(.id = "ID") |>
  mutate(Analysis = "Donohue")

simulations_LOTT <- subsets_models_LOTT |>
  bind_rows(.id = "ID") |>
  mutate(Analysis = "Lott")

simulations <- bind_rows(simulations_DONOHUE, simulations_LOTT)

# order for easier comparison
simulations <- simulations |>
  mutate(term = factor(term,
    levels = c(
      str_subset(unique(pull(simulations, term)), "years", negate = TRUE),
      sort(str_subset(unique(pull(simulations, term)), "years")))))
head(simulations)
# A tibble: 6 × 7
  ID        term                      estimate std.er…¹ stati…²  p.value Analy…³
  <chr>     <fct>                        <dbl>    <dbl>   <dbl>    <dbl> <chr>  
1 DONOHUE_1 RTC_LAWTRUE                 0.0237   0.0170   1.40  1.62e- 1 Donohue
2 DONOHUE_1 White_Male_15_to_19_years   0.0108   0.0282   0.382 7.02e- 1 Donohue
3 DONOHUE_1 White_Male_20_to_39_years   0.0294   0.0100   2.93  3.42e- 3 Donohue
4 DONOHUE_1 Black_Male_15_to_19_years  -0.0596   0.0578  -1.03  3.02e- 1 Donohue
5 DONOHUE_1 Black_Male_20_to_39_years   0.123    0.0195   6.34  3.12e-10 Donohue
6 DONOHUE_1 Other_Male_15_to_19_years   0.676    0.114    5.94  3.68e- 9 Donohue
# … with abbreviated variable names ¹​std.error, ²​statistic, ³​Analysis

Simulation: Plot (coefficients)

simulations |>
  ggplot(aes(x = term, y = estimate)) +
  geom_boxplot() +
  facet_grid(. ~ Analysis, scale = "free_x", space = "free", drop = TRUE) +
  labs(title = "Coefficient estimates",
       subtitle = "Estimates across leave-one-out analyses",
       x = "Term",
       y = "Coefficient",
       caption = "Results from simulations") +
  theme_linedraw() +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 70, hjust = 1),
        strip.text.x = element_text(size = 14, face = "bold"),
        plot.title.position="plot")

Simulation: Plot (sd)

coeff_sd <- simulations |>
  group_by(Analysis, term) |>
  summarize("SD" = sd(estimate))

coeff_sd |>
  ggplot(aes(x = Analysis, y = SD)) +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  labs(title = "Coefficient variability",
       subtitle = "SDs of coefficient estimates from leave-one-out analysis",
       x = "Term",
       y = "Coefficient Estimate \n Standard Deviations",
       caption = "Results from simulations") +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(size = 8, color = "black"),
        axis.text.y = element_text(color = "black"),
        plot.title.position="plot")

VIF: Variance Inflation Factor

VIF: Evaluating presence and severity of multicollinearity

Variance Inflation Factor (VIF): index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity.

Calculating VIF:

If our model is : \(Y = β_0 + β_1X_1 + β_2X_2 + β_3X_3 + e\), then:

To calculate the VIF value for \(X_1\), perform another OLS model, where \(X_1\) is now the dependent variable explained by the other explanatory variables:

\[X_1 = β_0 + β_2X_2 + β_3X_3 + e\]

From this model : \[VIF = \frac{1}{1-R^{2}}\]

…where \(R^2\) value is the proportion of variance in \(X_1\) explained by the other variables ( \(X_2\) and \(X_3\))

Repeat for each explanatory variable in the model.

. . . - the larger the VIF, the more multicollinearity - VIF of >=10 typically indicates problematic multicollinearity

Calculating VIF (Donohue)

# create model matrix
lm_DONOHUE_data <- as.data.frame(model.matrix(DONOHUE_OUTPUT))

# define model
lm_DONOHUE_data <- lm_DONOHUE_data |> 
  mutate(Viol_crime_rate_1k_log = plm::Within(pull(
    d_panel_DONOHUE, Viol_crime_rate_1k_log
  )), effect = "twoways")

# specify model
lm_DONOHUE <- lm(Viol_crime_rate_1k_log ~
RTC_LAWTRUE +
  White_Male_15_to_19_years +
  White_Male_20_to_39_years +
  Black_Male_15_to_19_years +
  Black_Male_20_to_39_years +
  Other_Male_15_to_19_years +
  Other_Male_20_to_39_years +
  Unemployment_rate +
  Poverty_rate +
  Population_log +
  police_per_100k_lag,
data = lm_DONOHUE_data
)

# calculate VIF
vif_DONOHUE <- vif(lm_DONOHUE)
# combine into nice table
vif_DONOHUE <- vif_DONOHUE |>
  as_tibble() |>
  cbind(names(vif_DONOHUE)) |>
  as_tibble()
colnames(vif_DONOHUE) <- c("VIF", "Variable")

vif_DONOHUE |>
  arrange(desc(VIF))
# A tibble: 11 × 2
     VIF Variable                 
   <dbl> <chr>                    
 1  1.72 White_Male_20_to_39_years
 2  1.66 Black_Male_20_to_39_years
 3  1.58 Other_Male_15_to_19_years
 4  1.52 Other_Male_20_to_39_years
 5  1.34 Black_Male_15_to_19_years
 6  1.27 Poverty_rate             
 7  1.23 Unemployment_rate        
 8  1.21 police_per_100k_lag      
 9  1.17 Population_log           
10  1.15 White_Male_15_to_19_years
11  1.11 RTC_LAWTRUE              

Calculating VIF (Lott)

lm_LOTT_data <- as.data.frame(model.matrix(LOTT_OUTPUT))
lm_LOTT_data <- lm_LOTT_data |>
  mutate(Viol_crime_rate_1k_log = plm::Within(pull(d_panel_LOTT, Viol_crime_rate_1k_log), effect = "twoways")) |>
  rename(RTC_LAW = RTC_LAWTRUE)
lm_LOTT <- lm(LOTT_fmla, data = lm_LOTT_data)

vif_LOTT <- vif(lm_LOTT)
vif_LOTT <- vif_LOTT |>
  as_tibble() |>
  cbind(names(vif_LOTT)) |>
  as_tibble()
colnames(vif_LOTT) <- c("VIF", "Variable")
# clean up names
vif_LOTT |> 
  mutate(Variable = str_replace(string = Variable,
                                pattern = "RTC_LAW",
                                replacement = "RTC_LAWTRUE")) |>
  arrange(desc(VIF))
# A tibble: 41 × 2
     VIF Variable                   
   <dbl> <chr>                      
 1  342. Black_Female_10_to_19_years
 2  327. Black_Male_10_to_19_years  
 3  251. Other_Male_40_to_49_years  
 4  227. Other_Female_40_to_49_years
 5  177. Other_Male_50_to_64_years  
 6  157. Other_Male_10_to_19_years  
 7  148. Other_Female_10_to_19_years
 8  134. Other_Female_50_to_64_years
 9  121. White_Female_10_to_19_years
10  120. White_Male_10_to_19_years  
# … with 31 more rows

VIF Across Analyses

vif_DONOHUE <- vif_DONOHUE |>
  mutate(Analysis = "Donohue")
vif_LOTT <- vif_LOTT |>
  mutate(Analysis = "Lott")
vif_df <- bind_rows(vif_DONOHUE, vif_LOTT)

VIF Plot

vif_df |>
  ggplot(aes(x = Analysis, y = VIF)) +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  geom_hline(yintercept = 10, color = "red") +
  geom_text(aes(.75, 13, label = "typical cutoff of 10")) +
  coord_trans(y = "log10") +
  labs(title = "Variance inflation factors") +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(color = "black"),
        axis.text.y = element_text(color = "black"))

What to do?

  • different model?
  • different predictors in this model?
  • different parameterization of predictors?
  • article for a detailed discussion about what to consider when your model has variables with high VIF values

Where to Go From Here

  • complete the lab!
  • start to work with your group through this analysis
  • consider what model you’ll want to use to answer the questions
  • think about/consider extension for CS01

Suggested Reading